WineQualityWhites-EDA

Univariate Plots Section

## [1] 4898   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality    
##  6      :2198  
##  5      :1457  
##  7      : 880  
##  8      : 175  
##  4      : 163  
##  3      :  20  
##  (Other):   5
The dataset contains 11 chemical component observations of 4898 white wines.Acidity,sugar,alcohol,etc.The quality score of the wine is shown in 0-10.

The quality score of the wine is close to the normal distribution, the average wine is in the majority (the score is 5-7), there are also a few poor quality wines, the quality is extremely rare.So the quality of the wine is related to what factors, the most important characteristics of the usual drinking, alcohol concentration, taste (sweet and sour).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
Adjust group width and X axis, most white wine's alcohol concentration is 9.5~11.The median is 10.40 and the mean is 10.51,they're very close.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
The content of non-volatile acid -tartaric acid is approximately normal distribution, with an average of 6.8g/dm^3, and the median is 6.855g/dm^3.In the grading of wine quality, for the x axis, fixed the acidity for Y axis box figure, found the vast majority of non-volatile acid content in 6 g/dm^3 between 3 ~ 9 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
The histogram of volatile acidity is left skewed , transform the data using a log transform. 

## [1]  7 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
Most wine have citric acidity betweed 0.2-0.4g/dm^3 ,19 wine citric acidity is 0,and 7 over 1.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.890   7.405   7.467   7.960  14.960

Add an total_acidity(fixed acidity+volatile acidity+citric acid) colume to dataset,most white wine total acidity value between 7-8g/dm^3,create the quality facet_grid,There is no indication that the wine of that quality has a   lower or higher acidity.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

## [1] 18 13
Most residual.sugar value less than 20g/dm^3,transformed the long tail data to better understand the distribution of residual sugar. The tranformed residual sugar distribution appears bimodal with the residual.sugar peaking around 2 or so and again at 10 or so.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Most chlorides value betweed 0.035-0.05g/dm^3,Some of them are pretty big,the max value is 0.346g/dm^3.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
The histogram of free.sulfur.dioxide is a normal distribution.Median is 34.00 and mean is 35.31,and have maximum 289.00. Most free sulfur dioxide between 25-50mg/dm^3,A few data is larger than 100mg/dm^3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
It is a normal distribution,media is 134.0,mean is 138.4.It also have some extreme value,min is 9.0,max is 440.Most total.sulfur.dioxide value between 105mg/dm^3-170mg/dm^3.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.I also want to know the influence of bound sulfur dioxide.Create a column named 'bound.sulfur.dioxide'(total.sulfur.dioxide-free.sulfur.dioxide).Build probability density curve,it seens high quality wine distribution skew to left.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

It's like a normal distribution.Most density value between 0.991g/cm^3-0.996g/cm^3,Max value is 1.039g/cm^3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The histagram of pH is normal distribution.The median is 3.18,mean 3.188.Max value 3.820
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

## [1] 4898   14
Most white wine have a sulphates value between 0.4g/dm^3-0.55g/dm^3.

Univariate Analysis

What is the structure of your dataset?

Number of Instances:white wine - 4898. 
Number of Attributes: 11 + output attribute(quality)
variables:
  fixed acidity (tartaric acid - g / dm^3)
  volatile acidity (acetic acid - g / dm^3)
  citric acid (g / dm^3)
  residual sugar (g / dm^3)
  chlorides (sodium chloride - g / dm^3
  free sulfur dioxide (mg / dm^3)
  total sulfur dioxide (mg / dm^3)
  density (g / cm^3)
  pH
  sulphates (potassium sulphate - g / dm3)
  alcohol (% by volume)
Output variable (based on sensory data): 
  quality (score between 0 and 10)
white_wine data stucture Contains numerous variables (11), the chemical components of white wine, such as acidity, sugar, chlorides,dioxide,alcohol accuracy(% by volume),PH value, sulphates, and quality(score between 0 and 10).

What is/are the main feature(s) of interest in your dataset?

I interest in how the pH value,residual sugar and alcohol concentration of white wine influence the wine quality    

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

As we know,the acid/chlorides content of the wine,will influence the pH value.  

Did you create any new variables from existing variables in the dataset?

I crate two new variables from existing variables,total.acidity and bound.sulfur.dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The min volatile.acidity value is 0.08,and the max value is 1.1 ,distribution skew to left,I use log(x),transform distribution ,it is easy to observe.I do the same thing to residual.sugar variables,transformed the long tail data to better understand the distribution of residual sugar. 

Bivariate Plots Section

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality     total_acidity    bound.sulfur.dioxide
##  6      :2198   Min.   : 4.130   Min.   :  4.0       
##  5      :1457   1st Qu.: 6.890   1st Qu.: 78.0       
##  7      : 880   Median : 7.405   Median :100.0       
##  8      : 175   Mean   : 7.467   Mean   :103.1       
##  4      : 163   3rd Qu.: 7.960   3rd Qu.:125.0       
##  3      :  20   Max.   :14.960   Max.   :331.0       
##  (Other):   5
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "total_acidity"        "bound.sulfur.dioxide"

Check the correlation coefficients between variables with ggcorr function.
The alcohol degree was negatively correlated with density and the correlation coefficient was 0.78.And residual sugar was positively correlated with density and the correlation coefficient was 0.84.total.sulfur.dioxide was positively correlated with density and the correlation coefficient was 0.53.
The fixed.acidity was negatively correlated with pH and the correlation codfficient was 0.43.
Next, explore the relationship between the two variables.Like:acidity and pH,alcohol,residual.sugar,total.sulfur.dioxide ,quality and density...

Acid is negatively correlated with pH,as fixed acidity increases,the pH value decrease.The relationship between fixed acidity appears to be linear.Next,check the relationship between other acids and pH.

## 
##  Pearson's product-moment correlation
## 
## data:  whites_wine$volatile.acidity and whites_wine$pH
## t = -2.2343, df = 4896, p-value = 0.02551
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.059868312 -0.003912409
## sample estimates:
##         cor 
## -0.03191537
As volatile acidity increase,the pH value not change muach.

## 
##  Pearson's product-moment correlation
## 
## data:  whites_wine$citric.acid and whites_wine$pH
## t = -11.614, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1908793 -0.1363671
## sample estimates:
##        cor 
## -0.1637482
It still smoothes trend,between citric acidity and pH.But there is a similar trend between citric acid, non-volatile acid and pH. I'm going to explore the relationship between all the acids and the ph.

## 
##  Pearson's product-moment correlation
## 
## data:  whites_wine$total_acidity and whites_wine$pH
## t = -33.388, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4531918 -0.4075605
## sample estimates:
##        cor 
## -0.4306513
Total acidity is more continuous,as total acidity increases,the pH value decrease.The relationship between fixed acidity appears to be linear. 

## 
##  Pearson's product-moment correlation
## 
## data:  whites_wine$sulphates and whites_wine$pH
## t = 11.047, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1285063 0.1831580
## sample estimates:
##       cor 
## 0.1559515
Most sulphates value between 0.4-0.5 ,as sulphates increase ,pH not change much.
## 
##  Pearson's product-moment correlation
## 
## data:  whites_wine$alcohol and whites_wine$pH
## t = 8.5601, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09374446 0.14893205
## sample estimates:
##       cor 
## 0.1214321

As alcohol increase ,pH not change much.
## 
##  Pearson's product-moment correlation
## 
## data:  whites_wine$residual.sugar and whites_wine$pH
## t = -13.847, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2209387 -0.1670352
## sample estimates:
##        cor 
## -0.1941335

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

When the residual sugar less than 5 g/dm^3,pH value  is more uncertainty.As residual sugar over about 5g/dm^3,pH value decrease slowly.

The pH of white wine is low on both sides.It seems that when the pH value  higher,it's quality of white wine depends on othe chemical component.Need to explore the chemicals that make pH higher.

White wines with high quality scores also have high median sulfate.

With the increase of residual sugar content, white wine density obviously increased.

As alcohol increase,the value of density decrease.  

The residual sugar value is high in the middle.As residual sugar decrease ,it seems that the quality of white wine depends on othe chemical component. Need to explore the chemicals that make sugar lower.

White wine with a high quality score has a higher median alcohol content.

White wine with a high quality score is relatively low in density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

High quality wines have a higher pH median than normal wines.Wines with high pH values also contain poor quality.

High quality white wines have lower median sugar levels than regular wines, but they also contain poor quality wines.It may be that the sugar is too high and too sweet, but too low can cause other problems.

High quality white wines are significantly more alcoholic than regular wines.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I also found that white wine pH associated with acid and sulfate, and was mainly affected by not volatile acid, not the more the content of volatile acid, pH value is smaller, the more the content of sulphate, high pH change accordingly.
The density of white wine decreases with the increase of alcohol content.

What was the strongest relationship you found?

The strongest relationship is as alcohol increase ,white wine density decrease,and quality of wine get more score. 

Multivariate Plots Section

It can be seen that the quality score of white wine is more distributed in areas with high alcohol degree and low density.

Wines with high quality scores are more distributed in low-sugar, low-density areas.

It can be seen that the quality of the wine is higher in the region with low fixed acid content and higher pH value.

It's the same distribution as the non-volatile acid.

The quality score of white wine is higher in the upper left corner.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Fewer acids increase the pH, and quality scores tend to increase.
The density has a lot to do with the quality of white wine.

Were there any interesting or surprising interactions between features?

White wine quality score has a lot to do with the size of its density, density decreases, and the score increased.Residual sugar and wine alcohol degree and is the main factor affecting the density, thus affecting the quality of the wine

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not crate models with my dataset.

Final Plots and Summary

Plot One

Description One

It can be seen that most the quality of the wine is higher in the region with low  acidity content and higher pH value.May be acidity affects the value of the pH, which in turn affects the quality of the wine.

Plot Two

Description Two

High - quality wines have higher levels of alcohol than ordinary wines but poor quality wines have higher concentrations of alcohol than normal wines,, may be because of other chemical components affect its quality.

Plot Three

Description Three

It can be seen that the quality score of white wine is more distributed in areas with high alcohol degree and low density.And there is a dominant relationship between alcohol degree and density. As the degree of alcohol increases, the density decreases.

Reflection

The relationship between the pH value of white wine, the alcohol degree and the residual sugar was just beginning to be explored.Because these are the characteristics that we pay attention to when we taste wine.Only the histogram, box graph and scatter diagram are used to explore, and some representational relationships are found roughly.The quality score is set as an ordered factor.The factors that affect the pH value are also explored, and it is found that acidic substances are the main causes of pH, but other relevant factors have not been found.The exploration of alcohol and residual sugar found that they were related to the density of fish wine, so I went to explore the relationship between density and quality.The quality of the wine is related to the pH and density, but I can't find the balance between the pH and the density to make the wine better.As for this data exploration, I think it is necessary to have a clear idea, and according to the actual situation, such as the characteristics of wine in reality, we should not blindly explore it.But at the same time, you can't explore it with a conclusion.
Through this data exploration, the future analysis work should have a clear understanding of the data and optimize the analytical thinking;Learn to process data, include data groups, and remodel to better discover data patterns.